In this report, we will be investigating the New York City Taxi and Limousine commission about “Green” Taxis. Green Taxis (as opposed to yellow ones) are taxis that are not allowed to pick up passengers inside of the densely populated areas of Manhattan.We are using NYC Taxi and Limousine trip record data from September 2015: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.
The data challenge is splitted into 5 questions. Our analysis report will fillow the questions step by step to conduct business understanding, data undertsanding, data preparation, modeling and evalution processes. All exploration,analysis and modeling of the dataset will be performed in R studio. The visualization packages will be ggplot2 and plotly.
Before starting up, firstly we will clear up the workspace and load all packages needed. #### Load the packages
#Clear the space
rm(list=ls())
#Load required packages
library('data.table') #data manipulation
library('dplyr') #pipeline operator
library('ggplot2') #visualization
library('curl') #R interface to libcurl
library('lubridate') #deal with time-related varaible
library('scales') #Graphical scales map
library('caret') #machine learning package
library('stringr') #deal with text variable
library('corrplot') #heatmap of correlation matrix
library('glmnet') #Lasso model
library('randomForest') #Random forest
library('maps') #map function
library('xgboost') #Xgboosting
library('zoo') #time related variable
library('VIM') #Missing value
library('rgdal') #Map location inforamtion
library('gridExtra') # grid layouts
library('pastecs') # details summary stats
library('gmodels') # build contingency tables
library('ggfortify') #PCA
library('plotly') #visualization
Question 1:
Load dataset
setwd("~/Downloads/ZillowNeighborhoods-NY")
temp <- tempfile()
download.file("https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv", temp)
rawdata <- read.csv(temp,na.strings = '',stringsAsFactors = F)
rm(temp)
Glimpse the dataset
head(rawdata)
The detailed description of column variables can be found in the link below: http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf
Check structure and dimension of the dataset
#Data structure
str(rawdata)
## 'data.frame': 1494926 obs. of 21 variables:
## $ VendorID : int 2 2 2 2 2 2 2 2 2 2 ...
## $ lpep_pickup_datetime : chr "2015-09-01 00:02:34" "2015-09-01 00:04:20" "2015-09-01 00:01:50" "2015-09-01 00:02:36" ...
## $ Lpep_dropoff_datetime: chr "2015-09-01 00:02:38" "2015-09-01 00:04:24" "2015-09-01 00:04:24" "2015-09-01 00:06:42" ...
## $ Store_and_fwd_flag : chr "N" "N" "N" "N" ...
## $ RateCodeID : int 5 5 1 1 1 1 1 1 1 1 ...
## $ Pickup_longitude : num -74 -74 -73.9 -73.9 -74 ...
## $ Pickup_latitude : num 40.7 40.9 40.8 40.8 40.7 ...
## $ Dropoff_longitude : num -74 -74 -73.9 -73.9 -73.9 ...
## $ Dropoff_latitude : num 40.7 40.9 40.8 40.8 40.7 ...
## $ Passenger_count : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Trip_distance : num 0 0 0.59 0.74 0.61 1.07 1.43 0.9 1.33 0.84 ...
## $ Fare_amount : num 7.8 45 4 5 5 5.5 6.5 5 6 5.5 ...
## $ Extra : num 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ MTA_tax : num 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ Tip_amount : num 1.95 0 0.5 0 0 1.36 0 0 1.46 0 ...
## $ Tolls_amount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Ehail_fee : logi NA NA NA NA NA NA ...
## $ improvement_surcharge: num 0 0 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ Total_amount : num 9.75 45 5.8 6.3 6.3 8.16 7.8 6.3 8.76 6.8 ...
## $ Payment_type : int 1 1 1 2 2 1 1 2 1 2 ...
## $ Trip_type : int 2 2 1 1 1 1 1 1 1 1 ...
#check whether there are duplicated records
which(duplicated(rawdata))
## integer(0)
#The size of dataset
dim(rawdata)
## [1] 1494926 21
There are no duplicate records for green taxi dataset from september,2015. There are 1494926 rows and 21 columns in this dataset.
Question 2:
Data Checking
#Check the missing values
length(which(is.na(rawdata$Trip_distance)))
## [1] 0
#Check negative value
length(which(rawdata$Trip_distance<0))
## [1] 0
#Summary of the variable
summary(rawdata$Trip_distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.100 1.980 2.968 3.740 603.100
The trip distance has no missing and negative values.It ranges from 0 to 603.1 miles.However, the mean and median values are all small than 3. It indicates that outliers exist in the raw dataset.
histogram of the number of the trip distance
Let’s have a look at the distribution of trip distance in raw dataset (Left figure).
#histogram for raw dataset
p1<-ggplot(rawdata, aes(x=Trip_distance)) +
geom_histogram(color="black", fill="red",bins=100,alpha=0.5)
#calculate % of trip distances within 20 miles
dim(rawdata[rawdata$Trip_distance<20,])[1]/dim(rawdata)[1]
## [1] 0.9977297
#generate subset
sub_data<-rawdata%>%filter(Trip_distance<20)
#histogram for trip distance<20
p2<-ggplot(sub_data, aes(x=Trip_distance)) +
geom_histogram(color="black", fill="red",bins=25,alpha=0.5)
grid.arrange(p1,p2, ncol=2)
Seen from the histogram on the left, 99.77% of the trip distances are within 20 miles. The trip distances larger than 20 miles are considered outliers. We plot the histogram of trip distances after removing outliers.After removing outliers, the distribution of trip Distance is still seriously skewed to the right(Right Figure).The majority of trips are short distances(<5 miles).
The hypothesis: The trip distances are asymetric distributed and seems to have a lognormal distribution structure.
To confirm our hypothesis, we logarithmize the trip distance and then fit a normal distribution to its density after removing outliers.As shown below,after the log-transformation, the distribution of Log(Trip Distance) is approximately like symmetric distribution. It is noted that there are 20592 zero values here and are removed when plotting due to infinity value after log-transformation.
sub_data<-rawdata%>%filter(Trip_distance<20)%>%mutate(Log_TripDistance=log(Trip_distance))
p3<-ggplot(sub_data, aes(x=Log_TripDistance)) +
geom_histogram(aes(y=..density..), colour="black", fill="red")+geom_density(alpha=.2, fill="blue")
p3
The QQ plot is also displayed below. The serious violation of normal distribution is mostly induced by the zero values. For the majority of observations, we can assume that they approximately follow normal distribution. This also confirms our hypothesis.
ggplot(sub_data, aes(sample=Log_TripDistance))+stat_qq()
Question 3:
#Extract hour information and generate a new varaible 'Hour_day'.
time<-strftime(rawdata$lpep_pickup_datetime, format="%H:%M:%S")
rawdata<-rawdata%>%mutate(Hour_day=as.numeric(substr(time,1,2)))
#Calculate mean and meadian trip distance group by hour of day
mean_median = rawdata%>%group_by(Hour_day) %>% summarise(mean_distance = round(mean(Trip_distance),2),
median_distance = round(median(Trip_distance),2))
print(mean_median)
## # A tibble: 24 x 3
## Hour_day mean_distance median_distance
## <dbl> <dbl> <dbl>
## 1 0 3.12 2.2
## 2 1 3.02 2.12
## 3 2 3.05 2.14
## 4 3 3.21 2.2
## 5 4 3.53 2.36
## 6 5 4.13 2.9
## 7 6 4.06 2.84
## 8 7 3.28 2.17
## 9 8 3.05 1.98
## 10 9 3 1.96
## # ... with 14 more rows
ggplot(mean_median, aes(Hour_day)) + ggtitle("Mean and Median of Trip Distance by Hour of Day")+
geom_line(aes(y = mean_distance, colour = "Mean of trip distance")) +
geom_line(aes(y = median_distance, colour = "Median of trip distance"))+ylab("Trip Distance")+xlab('Hour of Day')
Seen from the figure above, the mean and median of trip distances show the similiar trends.The maximum peaks happened in the morning during 5:00AM to 6:00AM.The morning peak may be due to that people take taxi to avoid being late for work or rushing to airports. The minmum mean and median appeared during 6:00PM to 7:00PM at evening. The possible reason is that people will take public transportation instead of taxi in the evening.
Next we identify trips that originate or terminate at one of the NYC area airports.The variable RateCodeID contains values indicating the final rate that was applied (JFK=2,Newark=3).Newark and JFK which are the two main airports in NYC.
#Calculate transactions,average fare and total amount for airports
transaction = rawdata%>% filter(RateCodeID == 2|RateCodeID == 3) %>%
summarise(n = n(),ave_fare=mean(Fare_amount,na.rm=T),ave_total=mean(Total_amount,na.rm=T))
print(transaction)
## n ave_fare ave_total
## 1 5552 48.97695 57.20842
A.Compare fare amount and total amount between JFK and Newark
There are two local airports in New York Region in this dataset: JFK and Newark. For variable RateCode ID, 2=JFK and 3=Newark. We want to further have a look at whether there are any difference in trips to/from NYC airports between the two airports in fare amount, total amount.Before comparison, we need to remove outliers with total amount or fare amount smaller than $0. We removed 2417 transactions.
#Generate dataset after removing outliers
dta<-rawdata%>% filter(Fare_amount>=0,Total_amount>=0)
#Calculate transactions,average fare and total amount for JFK
transaction_JFK = dta%>% filter(RateCodeID == 2) %>%
summarise(n = n(),ave_fare=mean(Fare_amount,na.rm=T),ave_total=mean(Total_amount,na.rm=T))
print(transaction_JFK)
## n ave_fare ave_total
## 1 4317 51.78318 59.52107
#Calculate transactions,average fare and total amount for Newark
transaction_Newark = dta%>% filter(RateCodeID == 3) %>%
summarise(n = n(),ave_fare=mean(Fare_amount,na.rm=T),ave_total=mean(Total_amount,na.rm=T))
print(transaction_Newark)
## n ave_fare ave_total
## 1 1093 50.32571 61.68318
#Generate subset for JFK and Newark respectively
Dat_JFK<-dta%>%filter(RateCodeID==2)
Dat_Newark<-dta%>%filter(RateCodeID==3)
#Welch t test for JFK and Newark in fare and total amount
t.test(Dat_JFK$Fare_amount,Dat_Newark$Fare_amount)
##
## Welch Two Sample t-test
##
## data: Dat_JFK$Fare_amount and Dat_Newark$Fare_amount
## t = 1.5923, df = 1098.8, p-value = 0.1116
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3385571 3.2535045
## sample estimates:
## mean of x mean of y
## 51.78318 50.32571
t.test(Dat_JFK$Total_amount,Dat_Newark$Total_amount)
##
## Welch Two Sample t-test
##
## data: Dat_JFK$Total_amount and Dat_Newark$Total_amount
## t = -1.7426, df = 1125.3, p-value = 0.08167
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.596464 0.272241
## sample estimates:
## mean of x mean of y
## 59.52107 61.68318
JFK There are 4317 transactions. The average fare is $51.78 per trip. The average total amount is $59.52 per trip.
Newark There are 1093 transactions. The average fare is $50.33 per trip The average total amount is $61.68 per trip.
Seen from welch t tests, there are no significant difference between JFK and Newark in fare amount and total amount on average (p value>0.05).
B.Compare trip distance distribution between JFK and Newark Then we further look at whether the trip distance distribution of JFK and Newark are different.
#plot distribution of trip distance between JFK and Newark
Airport<-c(rep("JFK",length(which(dta$RateCodeID==2))),rep("Newark",length(which(dta$RateCodeID==3))))
Airport_TripDis<-data.frame(c(Dat_JFK$Trip_distance,Dat_Newark$Trip_distance),Airport)
names(Airport_TripDis)<-c("Trip_Distance","Airport")
ggplot(Airport_TripDis,aes(x=Trip_Distance,fill=Airport))+
geom_histogram(aes(y=..density..),position='identity',alpha=0.5,binwidth=1)
Both of JFK and Newark trips follow the same trend as the majority of the trips are short trips (trip distance <5miles).In addition, there is an peak of long range JFK trips (>15 miles) which might correspond to a great number people coming to JFK airports from further areas.
C.Compare hourly distribution of trips between JFK and Newark We are also interested the hourly ditribution of trips between JFK and Newark.
Airport_all<-data.frame(rbind(Dat_JFK,Dat_Newark),Airport)
#Generate freq table group by airport and hour of day.
Airport_all_hour = Airport_all %>%
group_by(Airport,Hour_day) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
#Plot percentage barchart for JFK and Newark
ggplot(data=Airport_all_hour, aes(x = Hour_day, y=freq*100, fill=Airport)) +
geom_bar(stat="identity",position="dodge") + ylab("Percentage of Trips") +xlab("Hour of Day")
The hourly distribution shows that the number of trips at both airports peaks around 3PM. On the other hand, the number of trips to/from airports are in shortage at 2AM.
Similarly, we have a look at the differences between airport trips and non-airport trips.
A.Compare fare amount and total amount between airport trips and non-airport trips
#Genearte subset for airport trips and non-airport trips
airtrips<-dta%>% filter(RateCodeID == 2|RateCodeID == 3)%>%mutate(Trip='Airport')
non_airtrips<-dta%>% filter(RateCodeID != 2,RateCodeID != 3)%>%mutate(Trip='Non-Airport')
trip<-rbind(airtrips,non_airtrips)
#Calculate transactions,average fare and total amount for airport trips
transaction_air = airtrips%>%
summarise(n = n(),ave_fare=mean(Fare_amount,na.rm=T),ave_total=mean(Total_amount,na.rm=T))
print(transaction_air)
## n ave_fare ave_total
## 1 5410 51.48872 59.95789
#Calculate transactions,average fare and total amount for non airport trips
transaction_noair= non_airtrips %>%
summarise(n = n(),ave_fare=mean(Fare_amount,na.rm=T),ave_total=mean(Total_amount,na.rm=T))
print(transaction_noair)
## n ave_fare ave_total
## 1 1487099 12.43861 14.9117
#Welch t test for JFK and Newark in fare and total amount
t.test(airtrips$Fare_amount,non_airtrips$Fare_amount)
##
## Welch Two Sample t-test
##
## data: airtrips$Fare_amount and non_airtrips$Fare_amount
## t = 206.24, df = 5428.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 38.67892 39.42130
## sample estimates:
## mean of x mean of y
## 51.48872 12.43861
t.test(airtrips$Total_amount,non_airtrips$Total_amount)
##
## Welch Two Sample t-test
##
## data: airtrips$Total_amount and non_airtrips$Total_amount
## t = 162.65, df = 5420.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 44.50325 45.58913
## sample estimates:
## mean of x mean of y
## 59.95789 14.91170
airport trips There are 5410 transactions. The average fare is $51.49 per trip. The average total amount is $59.96 per trip.
Non airport trips There are 1487099 transactions. The average fare is $12.44 per trip The average total amount is $14.91 per trip.
Seen from welch t tests,the fare amount and total amount of airport trips are significantly higher than those of non-airport trips on average (p value<0.05). Seen from the results,non airport trips are usually short trips whereas airport trips are long trips since the distance between any of the two airports and Manhattan is over 10 miles. In next section, we further explore the trip distance distribution between airport trips and non-airport trips.
B.Compare trip distance distribution between airport trips and non-airport trips
As discussed in Question 2, the trip distances larger than 20 miles are considered outliers.
#Remove outliers
trip<-trip[trip$Trip_distance<=20,]
#Plot density figure for airport trips and non-airport trips
ggplot(trip,aes(x=Trip_distance,fill=Trip))+
geom_histogram(aes(y=..density..),position='identity',alpha=0.5,binwidth=1)
There are two peaks here.The majority of the trips are short trips (trip distance < 2miles).However, There is an peak of long range airport trips (>15 miles) that are due to people rushing to/from airport to downtown.
C.Compare hourly distribution of trips between airport trips and non-airport trips
#Generate freq table group by trip type and hour of day.
trip_hour = trip %>%
group_by(Trip,Hour_day) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
#Plot percentage barchart for JFK and Newark
ggplot(data=trip_hour, aes(x = Hour_day, y=freq*100, fill=Trip)) +
geom_bar(stat="identity",position="dodge") + ylab("Percentage of Trips") +xlab("Hour of Day")
The hourly distribution shows that the number of airport trips peaks around 3PM. On the other hand, the number of non-airport trips peaks at about 5PM.
Question 4:
Since the initial charge for green taxi is $2.5 in NYC(http://www.nyc.gov/html/tlc/html/passenger/taxicab_rate.shtml), any transaction smaller than 2.5 will be removed.
Generate derived variable
#Clear workspace
rm(Airport,Airport_all,Airport_all_hour,Airport_fare,Airport_total,Airport_trip,Airport_TripDis,
airtrips,Dat_JFK,Dat_Newark,dta,non_airtrips,p1,p2,p3,time,transaction_air,transaction_JFK,transaction_Newark,transaction_noair,trip,trip_hour)
#Create percentage of tip(%) variable
rawdata<-rawdata%>%filter(Total_amount>=2.5)%>%mutate(Tip_Per=(Tip_amount/Total_amount)*100)
#Distribution of Tip percentage(Tip_Per(%))
p<-ggplot(rawdata, aes(x=Tip_Per)) +
geom_histogram(color="black", fill="red")+xlab("Tip(%)")+theme_gray()
p
#Check % of no tip trips
sum(rawdata$Tip_Per==0)/dim(rawdata)[1]
## [1] 0.5948821
“Tip_Per” showed that 59.48% of all transactions did not give tips.This is a typical zero-inflated dataset.If we only use one model to predict all tip percentages, the large numbers of zero values will influence the model performance and bias the final prediction results.
One way to solve this problem is to build a tree-based regression-type model.This model can be considered as a combination of decision tree and regression type model.The general modeling strategy is as followings:
After data cleaning, we create a variable(Tip_give) to represent whether people give tip or not.
During data exploration and feature engineering in modeling dataset, we identify a feature as the node of a shallow decision tree. The feature should have the strongest association with ‘Tip_give’ in step 1 and can also be explained by reasonable business insights (i.e. domain knowledge).
After splitting modeling dataset into train/test datasets,the train dataset are then split into two subgroups according to the tree structure defined in step 2. In general, trips in first subgroup should have almost 0% tips and the majority of trips in the second group should have non-zero tips.
For train dataset, the predictive model will only be trained in the second subgroup.For the first subgroup, we predict tip percentage in each trip as 0.
We will compare the tree-based regression model with baseline linear regression model in test dataset to demonstrate the advantage of this new integration method. More details of advantages for this tree-based model framework will be introducted in final discussion section of this question.
Step 1: Check missing values
#Create missing information
mice_plot <- aggr(rawdata, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(rawdata), cex.axis=.7,
gap=1, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Ehail_fee 1.000000e+00
## Trip_type 2.688593e-06
## VendorID 0.000000e+00
## lpep_pickup_datetime 0.000000e+00
## Lpep_dropoff_datetime 0.000000e+00
## Store_and_fwd_flag 0.000000e+00
## RateCodeID 0.000000e+00
## Pickup_longitude 0.000000e+00
## Pickup_latitude 0.000000e+00
## Dropoff_longitude 0.000000e+00
## Dropoff_latitude 0.000000e+00
## Passenger_count 0.000000e+00
## Trip_distance 0.000000e+00
## Fare_amount 0.000000e+00
## Extra 0.000000e+00
## MTA_tax 0.000000e+00
## Tip_amount 0.000000e+00
## Tolls_amount 0.000000e+00
## improvement_surcharge 0.000000e+00
## Total_amount 0.000000e+00
## Payment_type 0.000000e+00
## Hour_day 0.000000e+00
## Tip_Per 0.000000e+00
#remove and impute missing values
rawdata<-rawdata[,!names(rawdata)%in%c('Ehail_fee')]
rawdata$Trip_type[is.na(rawdata$Trip_type)]<-1
There are only two vairbles that contain missing values. The variable Ehail_fee is 100% missing and we delete it from the dataset. The missing percentage Missing values of Trip_type were replaced with the most common value 1
Step 2: Data Checking based on data description
There are four types of variables in this dataset: Categorical, Continuous, Time and location
A. Categorical variables: VendorID,Store_and_fwd_flag,RateCodeID,Payment_type,Trip_type
#Check whether VendorID are coded as 1 and 2
table(rawdata$VendorID)
##
## 1 2
## 322149 1165618
#Check whether Store_and_fwd_flag are coded as Y and N
table(rawdata$Store_and_fwd_flag)
##
## N Y
## 1479198 8569
#Check whether Rate_ID are coded as 1-6
table(rawdata$RateCodeID)
##
## 1 2 3 4 5 6 99
## 1451627 4299 1091 912 29802 32 4
#Rate_ID: there are 4 trips coded as 99(<0.001%), we replace it with most common value 1
rawdata$RateCodeID[which(rawdata$RateCodeID==99)]<-1
#Check whether Payment_type are coded as 1-6
table(rawdata$Payment_type)
##
## 1 2 3 4 5
## 700934 779391 3922 3448 72
#Transform to categorical variables
factors_trans <- function(dta, variables){
for (variable in variables){
dta[[variable]] <- as.factor(dta[[variable]])
}
return(dta)
}
# select variables for data transformation
categorical_var <- c('VendorID','Store_and_fwd_flag','RateCodeID','Payment_type','Trip_type')
# transform data types
rawdata <- factors_trans(dta = rawdata,
variables=categorical_var)
# verify transformation in data frame details
str(rawdata)
## 'data.frame': 1487767 obs. of 22 variables:
## $ VendorID : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ lpep_pickup_datetime : chr "2015-09-01 00:02:34" "2015-09-01 00:04:20" "2015-09-01 00:01:50" "2015-09-01 00:02:36" ...
## $ Lpep_dropoff_datetime: chr "2015-09-01 00:02:38" "2015-09-01 00:04:24" "2015-09-01 00:04:24" "2015-09-01 00:06:42" ...
## $ Store_and_fwd_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ RateCodeID : Factor w/ 6 levels "1","2","3","4",..: 5 5 1 1 1 1 1 1 1 1 ...
## $ Pickup_longitude : num -74 -74 -73.9 -73.9 -74 ...
## $ Pickup_latitude : num 40.7 40.9 40.8 40.8 40.7 ...
## $ Dropoff_longitude : num -74 -74 -73.9 -73.9 -73.9 ...
## $ Dropoff_latitude : num 40.7 40.9 40.8 40.8 40.7 ...
## $ Passenger_count : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Trip_distance : num 0 0 0.59 0.74 0.61 1.07 1.43 0.9 1.33 0.84 ...
## $ Fare_amount : num 7.8 45 4 5 5 5.5 6.5 5 6 5.5 ...
## $ Extra : num 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ MTA_tax : num 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ Tip_amount : num 1.95 0 0.5 0 0 1.36 0 0 1.46 0 ...
## $ Tolls_amount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ improvement_surcharge: num 0 0 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
## $ Total_amount : num 9.75 45 5.8 6.3 6.3 8.16 7.8 6.3 8.76 6.8 ...
## $ Payment_type : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 2 2 1 1 2 1 2 ...
## $ Trip_type : Factor w/ 2 levels "1","2": 2 2 1 1 1 1 1 1 1 1 ...
## $ Hour_day : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Tip_Per : num 20 0 8.62 0 0 ...
B. Continuous variables:Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,improvement_surcharge,Total_amount
#Remove observations that have 0 passenger counts
rawdata<-rawdata%>%filter(Passenger_count>0)
#Remove observations that have trip distance=0 and trip distance>50(outliers and only 67 observations)
rawdata<-rawdata%>%filter(Trip_distance>0,Trip_distance<=50)
#Check whether amount has negative values or near zero variance
amount_ind<-c('Fare_amount','Extra','MTA_tax','Tip_amount','Tolls_amount',
'improvement_surcharge','Total_amount')
for (i in 1:length(amount_ind))
{
negative<-sum(as.numeric(rawdata[,amount_ind[[i]]])<0)
variance<-var(as.numeric(rawdata[,amount_ind[[i]]]))
print(paste0("The number of negative values of",sep=' ',amount_ind[[i]],'=',negative))
print(paste0("The variance of",sep=' ',amount_ind[[i]],'=',variance))
}
## [1] "The number of negative values of Fare_amount=0"
## [1] "The variance of Fare_amount=89.2860816637185"
## [1] "The number of negative values of Extra=1"
## [1] "The variance of Extra=0.133296125715707"
## [1] "The number of negative values of MTA_tax=0"
## [1] "The variance of MTA_tax=0.00395816861685441"
## [1] "The number of negative values of Tip_amount=0"
## [1] "The variance of Tip_amount=5.27887856683456"
## [1] "The number of negative values of Tolls_amount=0"
## [1] "The variance of Tolls_amount=0.788466421326487"
## [1] "The number of negative values of improvement_surcharge=0"
## [1] "The variance of improvement_surcharge=0.00140554156290991"
## [1] "The number of negative values of Total_amount=0"
## [1] "The variance of Total_amount=119.710569960655"
##Remove negative value of Extra variable (only 1)
rawdata<-rawdata%>%filter(Extra>=0)
C. Time variables:lpep_pickup_datetime,Lpep_dropoff_datetime
It is hard to directly use the pick up/drop off time. Thus we tranform the date into hour, weekday and weekend,week for september,2015.
#Convert to date type
rawdata<-rawdata%>%select(-Hour_day)
rawdata <- rawdata %>% mutate(lpep_pickup_datetime = ymd_hms(lpep_pickup_datetime),
pickup_hour=hour(lpep_pickup_datetime),
pickup_weekday=as.factor(weekdays(lpep_pickup_datetime)),
pickup_weekend=if_else(pickup_weekday=='Saturday'|pickup_weekday=='Sunday','Weekend','Weekday'),
Lpep_dropoff_datetime = ymd_hms( Lpep_dropoff_datetime),
dropoff_hour=hour(lpep_pickup_datetime),
dropoff_weekday=as.factor(weekdays(lpep_pickup_datetime)),
dropoff_weekend=if_else(dropoff_weekday=='Saturday'|dropoff_weekday=='Sunday','Weekend','Weekday')
)
#Transform to y/m/d date format
Date<-as.Date(rawdata$lpep_pickup_datetime,format="%Y-%m-%d")
#Extract day from date
Day<-as.numeric(substr(Date,9,10))
#Create week variable (week=1-5) from pick up time
Week<-array()
Week[which((Day>=1)&(Day<=6))]<-1
Week[which((Day>=7)&(Day<=13))]<-2
Week[which((Day>=14)&(Day<=20))]<-3
Week[which((Day>=21)&(Day<=27))]<-4
Week[which((Day>=28)&(Day<=30))]<-5
rawdata$week<-Week
The weekday, weekend and week variables were created with the possible hypothesis that people may be willing to tip depending on the day of week or time of the day.Though we created dropoff variables, we will use pickup time in model building.
D. Location variables:Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude
It is very hard to directly use longitude and latitude. Instead, we used districts where the pick-up activities (longitude and latitude) occured identified by Zillow NYC area shapefiles. The link is as followings: https://www.zillow.com/howto/api/neighborhood-boundaries.htm
# Load required package for geo-data processing
Map_NYC <- readOGR("ZillowNeighborhoods-NY.shp", layer="ZillowNeighborhoods-NY")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/weiwei.ouyang/Downloads/ZillowNeighborhoods-NY/ZillowNeighborhoods-NY.shp", layer: "ZillowNeighborhoods-NY"
## with 579 features
## It has 5 fields
Map_Filter = Map_NYC[Map_NYC$City == 'New York', ]
Location_Data_Pickup <- data.frame(Longitude = rawdata$Pickup_longitude,
Latitude = rawdata$Pickup_latitude)
Location_Data_Dropoff <- data.frame(Longitude = rawdata$Dropoff_longitude,
Latitude = rawdata$Dropoff_latitude)
# Identify the pickup location
coordinates(Location_Data_Pickup) <- ~ Longitude + Latitude
proj4string(Location_Data_Pickup) <- proj4string(Map_Filter)
County_Pickup = as.character(over(Location_Data_Pickup, Map_Filter)$County)
# Identify the dropoff location
coordinates(Location_Data_Dropoff) <- ~ Longitude + Latitude
proj4string(Location_Data_Dropoff) <- proj4string(Map_Filter)
County_Dropoff = as.character(over(Location_Data_Dropoff, Map_Filter)$County)
#Generate county variable
rawdata$County_Pickup = County_Pickup
rawdata$County_Dropoff = County_Dropoff
# Replace missing values with 'Unknown'
rawdata$County_Pickup[is.na(rawdata$County_Pickup)] = 'Other'
rawdata$County_Dropoff[is.na(rawdata$County_Dropoff)] = 'Other'
rm(Location_Data_Dropoff,Location_Data_Pickup)
We will only use pick up county here in next steps.
Feature Transformation and Feature engineering
We create Trip duration (Duration) and speed (Speed) variables.
#Generate modeling dataset
modeldata<-rawdata
###transform time related varaibles to categorical variables
modeldata$pickup_weekend<-as.factor(modeldata$pickup_weekend)
modeldata$dropoff_weekend<-as.factor(modeldata$dropoff_weekend)
modeldata$week<-as.factor(modeldata$week)
modeldata$pickup_hour<-as.factor(modeldata$pickup_hour)
###Transform location vairable to categorical variable
modeldata$County_Pickup<-as.factor(modeldata$County_Pickup)
##transform extra,MTA_tax and improvement_surcharge to categorical variables
modeldata$Extra<-as.factor(modeldata$Extra)
modeldata$MTA_tax<-as.factor(modeldata$MTA_tax)
modeldata$improvement_surcharge<-as.factor(modeldata$improvement_surcharge)
#Create trip duration (min).
modeldata$Duration<-as.numeric(modeldata$Lpep_dropoff_datetime-modeldata$lpep_pickup_datetime)/60
#Remove trip duration equal to 0, 141 trips are removed
modeldata<-modeldata%>%filter(Duration>0)
#Create speed (mile per hour)
modeldata$Speed<-(modeldata$Trip_distance)/(modeldata$Duration/60)
#For speed >200 mile per hour, we replace with the mean speed
modeldata[modeldata$Speed>200,]$Speed<-mean(modeldata[modeldata$Speed<=200,]$Speed)
#Recheck the data structure
str(modeldata)
## 'data.frame': 1468452 obs. of 32 variables:
## $ VendorID : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ lpep_pickup_datetime : POSIXct, format: "2015-09-01 00:01:50" "2015-09-01 00:02:36" ...
## $ Lpep_dropoff_datetime: POSIXct, format: "2015-09-01 00:04:24" "2015-09-01 00:06:42" ...
## $ Store_and_fwd_flag : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ RateCodeID : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Pickup_longitude : num -73.9 -73.9 -74 -73.9 -73.9 ...
## $ Pickup_latitude : num 40.8 40.8 40.7 40.8 40.7 ...
## $ Dropoff_longitude : num -73.9 -73.9 -73.9 -73.9 -73.9 ...
## $ Dropoff_latitude : num 40.8 40.8 40.7 40.8 40.8 ...
## $ Passenger_count : int 1 1 1 1 1 1 1 1 2 1 ...
## $ Trip_distance : num 0.59 0.74 0.61 1.07 1.43 0.9 1.33 0.84 0.8 0.7 ...
## $ Fare_amount : num 4 5 5 5.5 6.5 5 6 5.5 5 4 ...
## $ Extra : Factor w/ 3 levels "0","0.5","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ MTA_tax : Factor w/ 2 levels "0","0.5": 2 2 2 2 2 2 2 2 2 2 ...
## $ Tip_amount : num 0.5 0 0 1.36 0 0 1.46 0 0 1.06 ...
## $ Tolls_amount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ improvement_surcharge: Factor w/ 2 levels "0","0.3": 2 2 2 2 2 2 2 2 2 2 ...
## $ Total_amount : num 5.8 6.3 6.3 8.16 7.8 6.3 8.76 6.8 6.3 6.36 ...
## $ Payment_type : Factor w/ 5 levels "1","2","3","4",..: 1 2 2 1 1 2 1 2 2 1 ...
## $ Trip_type : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tip_Per : num 8.62 0 0 16.67 0 ...
## $ pickup_hour : Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pickup_weekday : Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ pickup_weekend : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
## $ dropoff_hour : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dropoff_weekday : Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ dropoff_weekend : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
## $ week : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County_Pickup : Factor w/ 6 levels "Bronx","Kings",..: 5 5 2 3 5 3 2 5 5 5 ...
## $ County_Dropoff : chr "Queens" "Queens" "Kings" "New York" ...
## $ Duration : num 2.57 4.1 4.1 4.68 4.97 ...
## $ Speed : num 13.79 10.83 8.93 13.71 17.28 ...
Create whether people give tips in trips
modeldata$Tip_give<-as.factor(ifelse(modeldata$Tip_Per>0,'Yes','No'))
Data Exploration
Before data exploration, we create some analysis functions used for data exploration of continuous and categorical variables.
#Continous summary function
Continous_stats <- function(x, detailed=FALSE){
options(scipen=100)
options(digits=2)
if (detailed){
var.stats <- stat.desc(x)
}else{
var.stats <- summary(x)
}
dta <- data.frame(round(as.numeric(var.stats),2))
colnames(dta) <- deparse(substitute(con))
rownames(dta) <- names(var.stats)
dta
}
#Continous visualization function
#Density and boxplot###
Visual_Distribution <- function(var){
pl1 <- qplot(var, geom="histogram",
fill=I('Red'), binwidth=5,
col=I('Black'))+ theme_bw()
pl2 <- qplot(var, geom="density",
fill=I('Red'), binwidth=5,
col=I('Black'))+ theme_bw()
grid.arrange(pl1,pl2, ncol=2)
}
# Scatter plots
Visual_Scatter <- function(x, y){
pl3 <- qplot(x, y,geom='point',colour = I("red"),
xlab = deparse(substitute(x)),
ylab = deparse(substitute(y))) + theme_bw()
pl3
}
# box plots
Visual_Boxplot <- function(x, y){
pl1 <- qplot(factor(0),x, geom="boxplot",
xlab = deparse(substitute(x)),
ylab="values") + theme_bw()
pl2 <- qplot(y,x,geom="boxplot",
xlab = deparse(substitute(y)),
ylab = deparse(substitute(x))) + theme_bw()
grid.arrange(pl1,pl2, ncol=2)
}
##Categorical summary function
# summary statistics
Categorical_stats <- function(x){
feature.name = deparse(substitute(x))
dta1 <- data.frame(table(x))
colnames(dta1) <- c(feature.name, "Frequency")
dta2 <- data.frame(prop.table(table(x)))
colnames(dta2) <- c(feature.name, "Proportion")
dta <- merge(
dta1, dta2, by = feature.name
)
dta_final <- dta[order(-dta$Frequency),]
dta_final
}
# Contingency table
Contingency_table <- function(y, x, test=F){
if(test == F){
CrossTable(y, x, digits=2,
prop.r=F, prop.t=F, prop.chisq=F)
}else{
CrossTable(y, x, digits=2,
prop.r=F, prop.t=F, prop.chisq=F,
chisq=T, fisher=T)
}
}
# visualizations
# barcharts
Visual_Barchart <- function(x){
qplot(x, geom="bar",
fill=I('red'), col=I('black'),
xlab = deparse(substitute(x))) + theme_bw()
}
# mosaic plots
Visual_mosaic<- function(y, x){
mosaicplot(y ~ x, color=T,
main = "Contingency table plot")
}
Data exploration for continous variables
attach(modeldata)
continuous_var<-c('Passenger_count','Trip_distance','Fare_amount','Tip_amount','Tolls_amount','Total_amount',
'Duration','Speed')
#Passenger_count
Visual_Barchart(Passenger_count)
Categorical_stats(Passenger_count)
Visual_Boxplot(Passenger_count,Tip_give)
Visual_Scatter(Passenger_count,Tip_Per)
#Trip_distance
Continous_stats(Trip_distance)
Visual_Distribution(Trip_distance)
Visual_Boxplot(Trip_distance,Tip_give)
Visual_Scatter(Trip_distance,Tip_Per)
#Fare_amount
Continous_stats(Fare_amount)
Visual_Distribution(Fare_amount)
Visual_Boxplot(Fare_amount,Tip_give)
Visual_Scatter(Fare_amount,Tip_Per)
#Tip_amount
Continous_stats(Tip_amount)
Visual_Distribution(Tip_amount)
Visual_Boxplot(Tip_amount,Tip_give)
Visual_Scatter(Tip_amount,Tip_Per)
#Tolls_amount
Continous_stats(Tolls_amount)
Visual_Distribution(Tolls_amount)
Visual_Boxplot(Tolls_amount,Tip_give)
Visual_Scatter(Tolls_amount,Tip_Per)
#Total_amount
Continous_stats(Total_amount)
Visual_Distribution(Total_amount)
Visual_Boxplot(Total_amount,Tip_give)
Visual_Scatter(Total_amount,Tip_Per)
#Duration
Continous_stats(Duration)
Visual_Distribution(Duration)
Visual_Boxplot(Duration,Tip_give)
Visual_Scatter(Duration,Tip_Per)
#Speed
Continous_stats(Speed)
Visual_Distribution(Speed)
Visual_Boxplot(Speed,Tip_give)
Visual_Scatter(Speed,Tip_Per)
#Correlation test for continuous variable and Tip_Per variable
for (i in 1:length(continuous_var))
{
cor<-round(as.numeric(cor.test(modeldata[,continuous_var[[i]]],modeldata$Tip_Per)$estimate),3)
p<-round(as.numeric(cor.test(modeldata[,continuous_var[[i]]],modeldata$Tip_Per)$p.value),3)
print(paste0("The correlation between",sep=' ',continuous_var[[i]],sep=' ','and Tip_per =',cor,sep=' ','with p value',sep=' ',p))
}
## [1] "The correlation between Passenger_count and Tip_per =0.001 with p value 0.116"
## [1] "The correlation between Trip_distance and Tip_per =0.1 with p value 0"
## [1] "The correlation between Fare_amount and Tip_per =0.096 with p value 0"
## [1] "The correlation between Tip_amount and Tip_per =0.734 with p value 0"
## [1] "The correlation between Tolls_amount and Tip_per =0.044 with p value 0"
## [1] "The correlation between Total_amount and Tip_per =0.242 with p value 0"
## [1] "The correlation between Duration and Tip_per =-0.012 with p value 0"
## [1] "The correlation between Speed and Tip_per =0.057 with p value 0"
##Generate heatmap for continuous variables
dta_feature<-modeldata[,names(modeldata)%in%continuous_var]
bc <- cor(dta_feature)
corrplot(bc, order ="hclust")
By exploring continuous variables, four main points were discovered:
For Passenger_count, the correlation test shows very small correlation with large p value. Thus we don’t consider it in next steps. Tip amount has strong correlation with Tip_per. The more tip amount is, the higher tip percentage is. This is kind of future leaking information. Thus we also don’t include Tip_amount into our modeling stage.
Except Passenger_count and Tip_amount, Other variables all show non-linear relationship with Tip_per. That is the reason why linear correlation is small for the majority of continuous variables. For Fare_amount, Tolls_amount,Speed, duration,the tip percentage (Tip_Per) decreases as these variables increase and converges around 25% from scatter plots. It may imply that people don’t want to tip more money with the increase of length of trip and trip duration. However, the tip percentage (Tip_Per) increases as Total_amount increases. From heatmap, Trip_distance, Fare_amount and Total_amount are highly correlated. That is the reason why Fare_amount, Tolls_amount and Total_amount have similiar scatterplots and data patterns.
For Speed, one interesting discovery is that higher speed leads to lower tips. The possible reason is that people don’t feel safe in high speed trips and don’t appreciate the trip service.
4.Based on boxplot, the value of these variables are not significantly different in Tip given and Tip not given group.
Data exploration for categorical variables
detach(modeldata)
categorical_var<-c('VendorID','Store_and_fwd_flag','RateCodeID','Extra','MTA_tax','improvement_surcharge',
'Payment_type','Trip_type','pickup_hour','pickup_weekday','pickup_weekend','week',
'County_Pickup')
attach(modeldata)
#VenderID
Categorical_stats (VendorID)
Contingency_table(Tip_give,VendorID,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | 1 | 2 | Row Total |
## -------------|-----------|-----------|-----------|
## No | 184515 | 685528 | 870043 |
## | 0.58 | 0.60 | |
## -------------|-----------|-----------|-----------|
## Yes | 133333 | 465076 | 598409 |
## | 0.42 | 0.40 | |
## -------------|-----------|-----------|-----------|
## Column Total | 317848 | 1150604 | 1468452 |
## | 0.22 | 0.78 | |
## -------------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,VendorID)
Visual_Boxplot(Tip_Per,VendorID)
#Store_and_fwd_flag
Categorical_stats (Store_and_fwd_flag)
Contingency_table(Tip_give,Store_and_fwd_flag,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | N | Y | Row Total |
## -------------|-----------|-----------|-----------|
## No | 864840 | 5203 | 870043 |
## | 0.59 | 0.64 | |
## -------------|-----------|-----------|-----------|
## Yes | 595527 | 2882 | 598409 |
## | 0.41 | 0.36 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1460367 | 8085 | 1468452 |
## | 0.99 | 0.01 | |
## -------------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,Store_and_fwd_flag)
Visual_Boxplot(Tip_Per,Store_and_fwd_flag)
#RateCodeID
Categorical_stats (RateCodeID)
Contingency_table(Tip_give,RateCodeID,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | 1 | 2 | 3 | 4 | 5 | 6 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## No | 847310 | 1847 | 480 | 509 | 19875 | 22 | 870043 |
## | 0.59 | 0.54 | 0.56 | 0.56 | 0.88 | 1.00 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Yes | 593253 | 1563 | 380 | 393 | 2820 | 0 | 598409 |
## | 0.41 | 0.46 | 0.44 | 0.44 | 0.12 | 0.00 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1440563 | 3410 | 860 | 902 | 22695 | 22 | 1468452 |
## | 0.98 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,RateCodeID)
Visual_Boxplot(Tip_Per,RateCodeID)
#Extra
Categorical_stats (Extra)
Contingency_table(Tip_give,Extra,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | 0 | 0.5 | 1 | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## No | 401387 | 324222 | 144434 | 870043 |
## | 0.60 | 0.58 | 0.60 | |
## -------------|-----------|-----------|-----------|-----------|
## Yes | 263449 | 238737 | 96223 | 598409 |
## | 0.40 | 0.42 | 0.40 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 664836 | 562959 | 240657 | 1468452 |
## | 0.45 | 0.38 | 0.16 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,Extra)
Visual_Boxplot(Tip_Per,Extra)
#MTA_tax#
Categorical_stats (MTA_tax)
Contingency_table(Tip_give,MTA_tax,test=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | 0 | 0.5 | Row Total |
## -------------|-----------|-----------|-----------|
## No | 20407 | 849636 | 870043 |
## | 0.86 | 0.59 | |
## -------------|-----------|-----------|-----------|
## Yes | 3188 | 595221 | 598409 |
## | 0.14 | 0.41 | |
## -------------|-----------|-----------|-----------|
## Column Total | 23595 | 1444857 | 1468452 |
## | 0.02 | 0.98 | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 7370 d.f. = 1 p = 0
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 7368 d.f. = 1 p = 0
##
##
## Fisher's Exact Test for Count Data
## ------------------------------------------------------------
## Sample estimate odds ratio: 4.5
##
## Alternative hypothesis: true odds ratio is not equal to 1
## p = 0
## 95% confidence interval: 4.3 4.7
##
## Alternative hypothesis: true odds ratio is less than 1
## p = 1
## 95% confidence interval: 0 4.6
##
## Alternative hypothesis: true odds ratio is greater than 1
## p = 0
## 95% confidence interval: 4.3 Inf
##
##
##
Visual_mosaic(Tip_give,MTA_tax)
Visual_Boxplot(Tip_Per,MTA_tax)
#improvement_surcharge#
Categorical_stats (improvement_surcharge)
Contingency_table(Tip_give,improvement_surcharge,test=T)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | 0 | 0.3 | Row Total |
## -------------|-----------|-----------|-----------|
## No | 20201 | 849842 | 870043 |
## | 0.87 | 0.59 | |
## -------------|-----------|-----------|-----------|
## Yes | 3089 | 595320 | 598409 |
## | 0.13 | 0.41 | |
## -------------|-----------|-----------|-----------|
## Column Total | 23290 | 1445162 | 1468452 |
## | 0.02 | 0.98 | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 7406 d.f. = 1 p = 0
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 7405 d.f. = 1 p = 0
##
##
## Fisher's Exact Test for Count Data
## ------------------------------------------------------------
## Sample estimate odds ratio: 4.6
##
## Alternative hypothesis: true odds ratio is not equal to 1
## p = 0
## 95% confidence interval: 4.4 4.8
##
## Alternative hypothesis: true odds ratio is less than 1
## p = 1
## 95% confidence interval: 0 4.7
##
## Alternative hypothesis: true odds ratio is greater than 1
## p = 0
## 95% confidence interval: 4.4 Inf
##
##
##
Visual_mosaic(Tip_give,improvement_surcharge)
Visual_Boxplot(Tip_Per,improvement_surcharge)
#Trip_type#
Categorical_stats (Trip_type)
Contingency_table(Tip_give,Trip_type,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | 1 | 2 | Row Total |
## -------------|-----------|-----------|-----------|
## No | 850348 | 19695 | 870043 |
## | 0.59 | 0.88 | |
## -------------|-----------|-----------|-----------|
## Yes | 595628 | 2781 | 598409 |
## | 0.41 | 0.12 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1445976 | 22476 | 1468452 |
## | 0.98 | 0.02 | |
## -------------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,Trip_type)
Visual_Boxplot(Tip_Per,Trip_type)
#pickup_hour#
Categorical_stats (pickup_hour)
Contingency_table(Tip_give,pickup_hour,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## No | 38014 | 31197 | 24415 | 19556 | 17596 | 10206 | 12197 | 23171 | 30601 | 32464 | 33039 | 34263 | 35317 | 35933 | 42636 | 46898 | 49152 | 53385 | 56286 | 54564 | 51806 | 47993 | 45493 | 43861 | 870043 |
## | 0.58 | 0.59 | 0.60 | 0.63 | 0.68 | 0.63 | 0.55 | 0.56 | 0.53 | 0.53 | 0.59 | 0.62 | 0.62 | 0.64 | 0.65 | 0.65 | 0.63 | 0.62 | 0.59 | 0.58 | 0.58 | 0.56 | 0.54 | 0.56 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Yes | 28076 | 21710 | 16012 | 11468 | 8239 | 6031 | 9832 | 17946 | 27295 | 28407 | 23246 | 21310 | 21278 | 20296 | 22634 | 25393 | 28498 | 33220 | 39466 | 40261 | 37674 | 37349 | 37989 | 34779 | 598409 |
## | 0.42 | 0.41 | 0.40 | 0.37 | 0.32 | 0.37 | 0.45 | 0.44 | 0.47 | 0.47 | 0.41 | 0.38 | 0.38 | 0.36 | 0.35 | 0.35 | 0.37 | 0.38 | 0.41 | 0.42 | 0.42 | 0.44 | 0.46 | 0.44 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 66090 | 52907 | 40427 | 31024 | 25835 | 16237 | 22029 | 41117 | 57896 | 60871 | 56285 | 55573 | 56595 | 56229 | 65270 | 72291 | 77650 | 86605 | 95752 | 94825 | 89480 | 85342 | 83482 | 78640 | 1468452 |
## | 0.05 | 0.04 | 0.03 | 0.02 | 0.02 | 0.01 | 0.02 | 0.03 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.05 | 0.05 | 0.06 | 0.07 | 0.06 | 0.06 | 0.06 | 0.06 | 0.05 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,pickup_hour)
Visual_Boxplot(Tip_Per,pickup_hour)
#pickup_weekday#
Categorical_stats (pickup_weekday)
Contingency_table(Tip_give,pickup_weekday,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | Friday | Monday | Saturday | Sunday | Thursday | Tuesday | Wednesday | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## No | 130142 | 98122 | 148920 | 128426 | 110621 | 123843 | 129969 | 870043 |
## | 0.60 | 0.60 | 0.58 | 0.59 | 0.59 | 0.60 | 0.59 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Yes | 88364 | 64112 | 106849 | 88724 | 77423 | 82239 | 90698 | 598409 |
## | 0.40 | 0.40 | 0.42 | 0.41 | 0.41 | 0.40 | 0.41 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 218506 | 162234 | 255769 | 217150 | 188044 | 206082 | 220667 | 1468452 |
## | 0.15 | 0.11 | 0.17 | 0.15 | 0.13 | 0.14 | 0.15 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,pickup_weekday)
Visual_Boxplot(Tip_Per,pickup_weekday)
#pickup_weekend#
Categorical_stats (pickup_weekend)
Contingency_table(Tip_give,pickup_weekend,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | Weekday | Weekend | Row Total |
## -------------|-----------|-----------|-----------|
## No | 592697 | 277346 | 870043 |
## | 0.60 | 0.59 | |
## -------------|-----------|-----------|-----------|
## Yes | 402836 | 195573 | 598409 |
## | 0.40 | 0.41 | |
## -------------|-----------|-----------|-----------|
## Column Total | 995533 | 472919 | 1468452 |
## | 0.68 | 0.32 | |
## -------------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,pickup_weekend)
Visual_Boxplot(Tip_Per,pickup_weekend)
#week#
Categorical_stats (week)
Contingency_table(Tip_give,week,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | 1 | 2 | 3 | 4 | 5 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## No | 181470 | 212250 | 206143 | 194721 | 75459 | 870043 |
## | 0.62 | 0.59 | 0.58 | 0.58 | 0.58 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Yes | 112393 | 145561 | 147713 | 138519 | 54223 | 598409 |
## | 0.38 | 0.41 | 0.42 | 0.42 | 0.42 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 293863 | 357811 | 353856 | 333240 | 129682 | 1468452 |
## | 0.20 | 0.24 | 0.24 | 0.23 | 0.09 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,week)
Visual_Boxplot(Tip_Per,week)
#County_Pickup#
Categorical_stats (modeldata$County_Pickup)
Contingency_table(Tip_give,modeldata$County_Pickup,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | Bronx | Kings | New York | Other | Queens | Richmond | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## No | 66799 | 267050 | 250671 | 1006 | 284462 | 55 | 870043 |
## | 0.83 | 0.48 | 0.60 | 0.62 | 0.70 | 0.74 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Yes | 13700 | 293362 | 169647 | 604 | 121077 | 19 | 598409 |
## | 0.17 | 0.52 | 0.40 | 0.38 | 0.30 | 0.26 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 80499 | 560412 | 420318 | 1610 | 405539 | 74 | 1468452 |
## | 0.05 | 0.38 | 0.29 | 0.00 | 0.28 | 0.00 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,modeldata$County_Pickup)
Visual_Boxplot(Tip_Per,modeldata$County_Pickup)
#Payment_type#
Categorical_stats (Payment_type)
Contingency_table(Tip_give,Payment_type,test=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1468452
##
##
## | x
## y | 1 | 2 | 3 | 4 | 5 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## No | 93970 | 769455 | 3314 | 3244 | 60 | 870043 |
## | 0.14 | 1.00 | 1.00 | 1.00 | 1.00 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Yes | 598406 | 2 | 0 | 1 | 0 | 598409 |
## | 0.86 | 0.00 | 0.00 | 0.00 | 0.00 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 692376 | 769457 | 3314 | 3245 | 60 | 1468452 |
## | 0.47 | 0.52 | 0.00 | 0.00 | 0.00 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
Visual_mosaic(Tip_give,Payment_type)
Visual_Boxplot(Tip_Per,Payment_type)
for (i in 1:length(categorical_var))
{
model <- lm(Tip_Per~modeldata[,categorical_var[[i]]],data= modeldata)
p1<-summary(model)$coefficients[2,4]
print(paste0("The p value of regression with",sep=' ',categorical_var[[i]],sep=' ','and Tip_Per =',sep=' ',p1))
}
## [1] "The p value of regression with VendorID and Tip_Per = 0.00000000000000000000000000000000126699033602445"
## [1] "The p value of regression with Store_and_fwd_flag and Tip_Per = 0.0000000000000000000185349404967839"
## [1] "The p value of regression with RateCodeID and Tip_Per = 0.00100410100506081"
## [1] "The p value of regression with Extra and Tip_Per = 1.7601148068309e-244"
## [1] "The p value of regression with MTA_tax and Tip_Per = 0"
## [1] "The p value of regression with improvement_surcharge and Tip_Per = 0"
## [1] "The p value of regression with Payment_type and Tip_Per = 0"
## [1] "The p value of regression with Trip_type and Tip_Per = 0"
## [1] "The p value of regression with pickup_hour and Tip_Per = 0.000261513563946544"
## [1] "The p value of regression with pickup_weekday and Tip_Per = 0.000000191197150785912"
## [1] "The p value of regression with pickup_weekend and Tip_Per = 0.00000000000000000000000186425185291282"
## [1] "The p value of regression with week and Tip_Per = 0.0000000000000000000000000000000000000000000000000000000000000000100197281205824"
## [1] "The p value of regression with County_Pickup and Tip_Per = 0"
Seen from mosaic plot, Payment_type is the significant factor that associate with Tip_give. Among the trips with credit card,86% had paid tips. For trips not paid by credit card, 0% had paid tips. Among trips with tips, ~100% are paid with creadit card(Payment_type=1).One of the possible reason for this phenomenon is that taxi drivers were more likely to report less tips then what they received when they were paid in cash. Cash pay is a way to evade tax and taxi companies will earn some of the tips if the trip was paid with credit card. Thus Payment_type is a good feature as the node of our decision tree. We create a new binary categorical ‘payment_card’ to represent whether the trip is paid with credit card or not.
#create payment_card variable
modeldata$payment_card<-as.factor(ifelse(modeldata$Payment_type==1,'Yes','No'))
Based on data exploration, the variables with p value<0.05 in univariate analysis are chosen as candidate features in modeling stage.The target variable is Tip_Per.
Candidate feature variable set
final_set<-c('Trip_distance','Fare_amount','Tolls_amount','Total_amount',
'Duration','Speed','VendorID','Store_and_fwd_flag','RateCodeID','Extra','MTA_tax',
'improvement_surcharge','payment_card','Trip_type','pickup_hour','pickup_weekday','pickup_weekend','week','County_Pickup')
final_modeldata<-modeldata[,c(final_set,'Tip_Per')]
rm(modeldata,rawdata)
backupdata<-final_modeldata
Normalize continuous variables
## normalizing - scaling
scale.features <- function(dta, variables){
for (variable in variables){
dta[[variable]] <- scale(dta[[variable]], center=T, scale=T)
}
return(dta)
}
continuous_var<-c('Trip_distance','Fare_amount','Tolls_amount','Total_amount','Duration','Speed')
final_modeldata <- scale.features(final_modeldata, continuous_var)
Generate training/testing dataset and check their data pattern
# split data into training and test datasets in 80:20 ratio
set.seed(1234)
indexes <- sample(1:nrow(final_modeldata), size=0.8*nrow(final_modeldata))
train_data <- final_modeldata[indexes,]
test_data <- final_modeldata[-indexes,]
#Check data pattern using PCA analysis
train_feature<-train_data[,c('Trip_distance','Fare_amount','Tolls_amount','Total_amount',
'Duration','Speed','Tip_Per')]
test_feature<-test_data[,c('Trip_distance','Fare_amount','Tolls_amount','Total_amount',
'Duration','Speed','Tip_Per')]
pca <- prcomp(train_feature,
scale. = TRUE)
a1<-autoplot(pca,data = train_feature)+ggtitle("Training dataset")
pca1 <- prcomp(test_feature,scale. = TRUE)
a2<-autoplot(pca1,data = test_feature)+ggtitle("Testing dataset")
grid.arrange(a1,a2, ncol=2)
rm(train_feature,test_feature)
Seen from the biplot of PCA, the train and test datasets have similiar pattern.There are 1174761 trips and 293691 trips in train and test datasets respectively. For model comparison, we will use MSE (Mean Square Error) and correlation between observed and predicted values as the two main metrics to measure model performance. In the test data, the mean square error (MSE) is also compared with the variance of Tip_Per in the test dataset. If MSE is smaller than the variance of Per_tip in test dataset, the predictive model is considered as a good model.
Generate baseline model: stepwise regression using entire dataset We firstly generate a linear regression using entire train datasets. Stepwise algorithom is applied here for feature selection.
#Full<-lm(Tip_Per~.,data=train_data)
#Null<-lm(Tip_Per~1,data=train_data)
#step(Null, scope = list(upper=Full), data=train, direction="both")
#Note:we run the codes above to get final model, to save running time, I only show the final model
#Final model
base<-lm(formula = Tip_Per ~ payment_card + County_Pickup + Total_amount +
Fare_amount + Tolls_amount + Extra + RateCodeID + pickup_hour +
Duration + Trip_distance + Speed + MTA_tax + VendorID + week +
pickup_weekday + improvement_surcharge + Trip_type, data = train_data)
summary(base)
##
## Call:
## lm(formula = Tip_Per ~ payment_card + County_Pickup + Total_amount +
## Fare_amount + Tolls_amount + Extra + RateCodeID + pickup_hour +
## Duration + Trip_distance + Speed + MTA_tax + VendorID + week +
## pickup_weekday + improvement_surcharge + Trip_type, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513.2 -1.2 -0.2 1.9 82.6
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 5.277537 0.302678 17.44
## payment_cardYes 9.417827 0.008929 1054.76
## County_PickupKings 1.337144 0.016883 79.20
## County_PickupNew York 0.676215 0.016985 39.81
## County_PickupOther 0.551320 0.109539 5.03
## County_PickupQueens 0.937773 0.017036 55.05
## County_PickupRichmond -1.515093 0.490233 -3.09
## Total_amount 21.742103 0.023222 936.28
## Fare_amount -20.415433 0.023354 -874.18
## Tolls_amount -2.002756 0.004443 -450.81
## Extra0.5 -1.236876 0.032872 -37.63
## Extra1 -2.261774 0.016980 -133.20
## RateCodeID2 1.448939 0.076740 18.88
## RateCodeID3 -2.132177 0.309385 -6.89
## RateCodeID4 1.309828 0.146538 8.94
## RateCodeID5 -0.272652 0.147916 -1.84
## RateCodeID6 -1.407984 0.909400 -1.55
## pickup_hour1 -0.012798 0.025184 -0.51
## pickup_hour2 -0.031492 0.027246 -1.16
## pickup_hour3 -0.065209 0.029717 -2.19
## pickup_hour4 -0.157574 0.031710 -4.97
## pickup_hour5 -0.282029 0.038049 -7.41
## pickup_hour6 -0.378495 0.045864 -8.25
## pickup_hour7 -0.405129 0.042226 -9.59
## pickup_hour8 -0.280577 0.040731 -6.89
## pickup_hour9 -0.179145 0.040468 -4.43
## pickup_hour10 -0.108404 0.040728 -2.66
## pickup_hour11 -0.182564 0.040802 -4.47
## pickup_hour12 -0.133622 0.040742 -3.28
## pickup_hour13 -0.171270 0.040759 -4.20
## pickup_hour14 -0.201150 0.040231 -5.00
## pickup_hour15 -0.194052 0.039825 -4.87
## pickup_hour16 -0.153876 0.039240 -3.92
## pickup_hour17 -0.115153 0.038950 -2.96
## pickup_hour18 -0.020400 0.038668 -0.53
## pickup_hour19 0.000306 0.037715 0.01
## pickup_hour20 0.116619 0.022221 5.25
## pickup_hour21 0.119135 0.022423 5.31
## pickup_hour22 0.121130 0.022514 5.38
## pickup_hour23 0.039273 0.022793 1.72
## Duration -0.073994 0.003668 -20.17
## Trip_distance -0.225260 0.010589 -21.27
## Speed 0.055924 0.004572 12.23
## MTA_tax0.5 -2.224119 0.269293 -8.26
## VendorID2 -0.080048 0.008684 -9.22
## week2 0.023689 0.010847 2.18
## week3 0.040601 0.010858 3.74
## week4 0.084885 0.011024 7.70
## week5 0.089426 0.015560 5.75
## pickup_weekdayMonday 0.032330 0.014759 2.19
## pickup_weekdaySaturday 0.075189 0.013241 5.68
## pickup_weekdaySunday 0.099041 0.013812 7.17
## pickup_weekdayThursday 0.026660 0.013576 1.96
## pickup_weekdayTuesday 0.032353 0.013569 2.38
## pickup_weekdayWednesday 0.029390 0.013341 2.20
## improvement_surcharge0.3 -0.915731 0.146483 -6.25
## Trip_type2 -1.292672 0.266413 -4.85
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## payment_cardYes < 0.0000000000000002 ***
## County_PickupKings < 0.0000000000000002 ***
## County_PickupNew York < 0.0000000000000002 ***
## County_PickupOther 0.000000482711630 ***
## County_PickupQueens < 0.0000000000000002 ***
## County_PickupRichmond 0.00200 **
## Total_amount < 0.0000000000000002 ***
## Fare_amount < 0.0000000000000002 ***
## Tolls_amount < 0.0000000000000002 ***
## Extra0.5 < 0.0000000000000002 ***
## Extra1 < 0.0000000000000002 ***
## RateCodeID2 < 0.0000000000000002 ***
## RateCodeID3 0.000000000005517 ***
## RateCodeID4 < 0.0000000000000002 ***
## RateCodeID5 0.06529 .
## RateCodeID6 0.12156
## pickup_hour1 0.61132
## pickup_hour2 0.24775
## pickup_hour3 0.02821 *
## pickup_hour4 0.000000672173486 ***
## pickup_hour5 0.000000000000124 ***
## pickup_hour6 < 0.0000000000000002 ***
## pickup_hour7 < 0.0000000000000002 ***
## pickup_hour8 0.000000000005638 ***
## pickup_hour9 0.000009561496699 ***
## pickup_hour10 0.00778 **
## pickup_hour11 0.000007663748382 ***
## pickup_hour12 0.00104 **
## pickup_hour13 0.000026451429748 ***
## pickup_hour14 0.000000573838639 ***
## pickup_hour15 0.000001101488873 ***
## pickup_hour16 0.000088038664071 ***
## pickup_hour17 0.00311 **
## pickup_hour18 0.59780
## pickup_hour19 0.99352
## pickup_hour20 0.000000153616482 ***
## pickup_hour21 0.000000107852000 ***
## pickup_hour22 0.000000074419589 ***
## pickup_hour23 0.08488 .
## Duration < 0.0000000000000002 ***
## Trip_distance < 0.0000000000000002 ***
## Speed < 0.0000000000000002 ***
## MTA_tax0.5 < 0.0000000000000002 ***
## VendorID2 < 0.0000000000000002 ***
## week2 0.02896 *
## week3 0.00018 ***
## week4 0.000000000000014 ***
## week5 0.000000009071861 ***
## pickup_weekdayMonday 0.02849 *
## pickup_weekdaySaturday 0.000000013595042 ***
## pickup_weekdaySunday 0.000000000000746 ***
## pickup_weekdayThursday 0.04956 *
## pickup_weekdayTuesday 0.01711 *
## pickup_weekdayWednesday 0.02759 *
## improvement_surcharge0.3 0.000000000406787 ***
## Trip_type2 0.000001221586467 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.9 on 1174704 degrees of freedom
## Multiple R-squared: 0.805, Adjusted R-squared: 0.805
## F-statistic: 8.65e+04 on 56 and 1174704 DF, p-value: <0.0000000000000002
predict_base <- predict(base, newdata = test_data)
#MSE
MSE_base<-sum((test_data$Tip_Per-predict_base)^2)/dim(test_data)[1]
#Correlation
cor_base<-as.numeric(cor.test(predict_base,test_data$Tip_Per)$estimate)
#variance of Tip_Per in test dataset
var_test<-var(test_data$Tip_Per)
Generate tree-based regression model As we discussed, we firstly split train dataset into two subgroups using variable payment_card. For payment_card==‘No’ group, we predict all trips without tips. The linear-type model is only considered in payment_card==‘Yes’ group. Here two different regression models are considered: stepwise linear regression and penalized regression.
Note: random froest, Gradient Boosting and ensemble method: random froest regression+Gradient Boosting, the mean of the predicted values that we obtain using the Random Forestand the Gradient Boosting are also considered here. But their performacnes are very similar to stepwise and penalized regression here. To avoid the long waiting time for generating the final notebook, I skip showing their results here. These two commonly used methods can be trained by train() function in R with ‘rf’ and ‘XgbTree’. The main purpose of modeling part is still to demonstrate the advantage of tree-based regression model for predicting continuous variable.
#Split the train/test dataset using payment_card
sub_train1<-train_data[train_data$payment_card=='No',]
sub_train2<-train_data[train_data$payment_card=='Yes',]
sub_train2<-sub_train2%>%select(-payment_card)
sub_test1<-test_data[test_data$payment_card=='No',]
sub_test2<-test_data[test_data$payment_card=='Yes',]
sub_test2<-sub_test2%>%select(-payment_card)
#Check Tip_Per in sub_train1
mean(sub_train1$Tip_Per)
## [1] 0.000054
The mean of Tip_per in payment_card==‘No’ group is 5.37e-05. we predict trips in this group 0. We train our model in payment_card==‘Yes’ group (sub_train2).
stepwise regression
#Full<-lm(Tip_Per~.,data=sub_train2)
#Null<-lm(Tip_Per~1,data=sub_train2)
#reg1<-step(Null, scope = list(upper=Full), data=sub_train2, direction="both")
#Note:we run the codes above to get final model, to save running time, I only show the final model
reg1<-lm(formula = Tip_Per ~ County_Pickup + Fare_amount + Total_amount +
Tolls_amount + Extra + Duration + RateCodeID + pickup_hour +
VendorID + Trip_distance + Speed + week + pickup_weekday +
improvement_surcharge, data = sub_train2)
summary(reg1)
##
## Call:
## lm(formula = Tip_Per ~ County_Pickup + Fare_amount + Total_amount +
## Tolls_amount + Extra + Duration + RateCodeID + pickup_hour +
## VendorID + Trip_distance + Speed + week + pickup_weekday +
## improvement_surcharge, data = sub_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -662 -1 1 2 130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.68737 0.23750 45.00 < 0.0000000000000002
## County_PickupKings 3.19833 0.03794 84.31 < 0.0000000000000002
## County_PickupNew York 2.24383 0.03845 58.35 < 0.0000000000000002
## County_PickupOther 1.83682 0.20041 9.17 < 0.0000000000000002
## County_PickupQueens 2.78708 0.03920 71.10 < 0.0000000000000002
## County_PickupRichmond -3.07703 1.04625 -2.94 0.00327
## Fare_amount -27.42167 0.03730 -735.25 < 0.0000000000000002
## Total_amount 27.33052 0.03287 831.41 < 0.0000000000000002
## Tolls_amount -2.54203 0.00687 -369.84 < 0.0000000000000002
## Extra0.5 -1.63377 0.06478 -25.22 < 0.0000000000000002
## Extra1 -3.19261 0.03189 -100.12 < 0.0000000000000002
## Duration -0.15752 0.00724 -21.76 < 0.0000000000000002
## RateCodeID2 0.18137 0.13976 1.30 0.19437
## RateCodeID3 2.12521 0.27698 7.67 0.000000000000017
## RateCodeID4 1.01859 0.26147 3.90 0.000097954421453
## RateCodeID5 0.34973 0.23061 1.52 0.12937
## pickup_hour1 -0.03846 0.04601 -0.84 0.40323
## pickup_hour2 -0.07251 0.05043 -1.44 0.15045
## pickup_hour3 -0.16188 0.05619 -2.88 0.00396
## pickup_hour4 -0.32044 0.06330 -5.06 0.000000414117809
## pickup_hour5 -0.38972 0.07235 -5.39 0.000000071784818
## pickup_hour6 -0.33975 0.08627 -3.94 0.000082062828354
## pickup_hour7 -0.44644 0.08054 -5.54 0.000000029670449
## pickup_hour8 -0.23468 0.07769 -3.02 0.00252
## pickup_hour9 -0.08295 0.07739 -1.07 0.28380
## pickup_hour10 0.01345 0.07862 0.17 0.86413
## pickup_hour11 -0.06986 0.07920 -0.88 0.37774
## pickup_hour12 -0.03854 0.07920 -0.49 0.62653
## pickup_hour13 -0.08931 0.07943 -1.12 0.26088
## pickup_hour14 -0.17690 0.07866 -2.25 0.02451
## pickup_hour15 -0.16775 0.07772 -2.16 0.03091
## pickup_hour16 -0.11708 0.07622 -1.54 0.12452
## pickup_hour17 -0.01781 0.07544 -0.24 0.81335
## pickup_hour18 0.09174 0.07472 1.23 0.21952
## pickup_hour19 0.13643 0.07271 1.88 0.06058
## pickup_hour20 0.16267 0.04052 4.01 0.000059517590792
## pickup_hour21 0.14307 0.04057 3.53 0.00042
## pickup_hour22 0.17352 0.04039 4.30 0.000017400727169
## pickup_hour23 0.07612 0.04102 1.86 0.06349
## VendorID2 -0.18944 0.01612 -11.75 < 0.0000000000000002
## Trip_distance -0.28771 0.02100 -13.70 < 0.0000000000000002
## Speed 0.10831 0.00871 12.43 < 0.0000000000000002
## week2 0.05389 0.02036 2.65 0.00813
## week3 0.08064 0.02029 3.97 0.000070558082848
## week4 0.12372 0.02062 6.00 0.000000001973734
## week5 0.13904 0.02877 4.83 0.000001344640481
## pickup_weekdayMonday 0.02567 0.02750 0.93 0.35058
## pickup_weekdaySaturday 0.09650 0.02442 3.95 0.000077359454677
## pickup_weekdaySunday 0.12593 0.02566 4.91 0.000000920376700
## pickup_weekdayThursday 0.06087 0.02501 2.43 0.01493
## pickup_weekdayTuesday 0.05540 0.02525 2.19 0.02821
## pickup_weekdayWednesday 0.08500 0.02467 3.45 0.00057
## improvement_surcharge0.3 -0.89395 0.22294 -4.01 0.000060743889936
##
## (Intercept) ***
## County_PickupKings ***
## County_PickupNew York ***
## County_PickupOther ***
## County_PickupQueens ***
## County_PickupRichmond **
## Fare_amount ***
## Total_amount ***
## Tolls_amount ***
## Extra0.5 ***
## Extra1 ***
## Duration ***
## RateCodeID2
## RateCodeID3 ***
## RateCodeID4 ***
## RateCodeID5
## pickup_hour1
## pickup_hour2
## pickup_hour3 **
## pickup_hour4 ***
## pickup_hour5 ***
## pickup_hour6 ***
## pickup_hour7 ***
## pickup_hour8 **
## pickup_hour9
## pickup_hour10
## pickup_hour11
## pickup_hour12
## pickup_hour13
## pickup_hour14 *
## pickup_hour15 *
## pickup_hour16
## pickup_hour17
## pickup_hour18
## pickup_hour19 .
## pickup_hour20 ***
## pickup_hour21 ***
## pickup_hour22 ***
## pickup_hour23 .
## VendorID2 ***
## Trip_distance ***
## Speed ***
## week2 **
## week3 ***
## week4 ***
## week5 ***
## pickup_weekdayMonday
## pickup_weekdaySaturday ***
## pickup_weekdaySunday ***
## pickup_weekdayThursday *
## pickup_weekdayTuesday *
## pickup_weekdayWednesday ***
## improvement_surcharge0.3 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.9 on 554135 degrees of freedom
## Multiple R-squared: 0.57, Adjusted R-squared: 0.57
## F-statistic: 1.41e+04 on 52 and 554135 DF, p-value: <0.0000000000000002
predict_step <- predict(reg1, newdata = sub_test2)
importance <- varImp(reg1,scale = FALSE)
importance
#MSE
MSE_step<-(sum((sub_test2$Tip_Per-predict_step)^2)+sum((sub_test1$Tip_Per)^2))/dim(test_data)[1]
#Correlation
pre<-c(rep(0,dim(sub_test1)[1]),predict_step)
dta<-rbind(sub_test1[,!names(sub_test1)%in%c('payment_card')],sub_test2)
cor_step<-as.numeric(cor.test(pre,dta$Tip_Per)$estimate)
rm(pre,dta)
Penalized regression
set.seed(1234)
formula.init <- "Tip_Per~."
formula.init <- as.formula(formula.init)
#5-cross validation,repeat 2 times
control <- trainControl(method="repeatedcv", number=5, repeats=2)
model <- train(formula.init, data=sub_train2, method="glmnet",trControl=control)
#Get optimal parameter: alpha=0.55,s=0.0016
train_mat<-model.matrix(Tip_Per~.,sub_train2)[,1:18]
test_mat<-model.matrix(Tip_Per~.,sub_test2)[,1:18]
temp<-glmnet(train_mat, sub_train2[,19],alpha=0.55)
predict_penal<-predict(temp,newx=test_mat,type='response', s=0.0016)
importance <- varImp(model,scale=FALSE)
importance
## glmnet variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## Fare_amount 27.130
## Total_amount 27.061
## County_PickupKings 3.170
## Extra1 3.146
## County_PickupRichmond 2.876
## County_PickupQueens 2.754
## Tolls_amount 2.514
## County_PickupNew York 2.208
## County_PickupOther 1.793
## RateCodeID3 1.674
## Extra0.5 1.545
## RateCodeID4 0.931
## improvement_surcharge0.3 0.791
## pickup_hour5 0.372
## pickup_hour7 0.368
## MTA_tax0.5 0.331
## pickup_hour4 0.313
## Trip_distance 0.307
## pickup_hour6 0.258
## pickup_hour19 0.193
#MSE
MSE_penal<-(sum((sub_test2$Tip_Per-predict_penal)^2)+sum((sub_test1$Tip_Per)^2))/dim(test_data)[1]
#Correlation
pre<-c(rep(0,dim(sub_test1)[1]),predict_penal)
dta<-rbind(sub_test1[,!names(sub_test1)%in%c('payment_card')],sub_test2)
cor_penal<-as.numeric(cor.test(pre,dta$Tip_Per)$estimate)
rm(pre,dta)
From both stepwise regression or penalized regression, the top 3 main factors that contribute to prediction are Total_amount,Fare_amount,Tolls_amount and County(i.e. whether your piackup location is in King or not). The model results indicated that more fare amount and tolls amount would lead to lower percentage of tips.In addition, the pickup location in King County would have a higher tip percentage in general.
For model comparison, the variance of Tip_Per in test dataset is 76.43. MSE of baseline regression using entire train dataset (MSE_base) is 15.54, the corrleation between observed and predicted percentage of tips(%) (cor_base) is 0.893. The MSE of tree-based stepwise regression model (MSE_step) is 12.13 and correlation is 0.917.The MSE of tree-based stepwise regression model (MSE_step) is 12.32 and correlation is 0.916. The tree-based stepwise regression has the lowest MSE and highest correlation. Among these three models, it is the optimal one though it perform a little worse than ensemble methods (Not shown here due to computing time).
Discussion Compared to traditional group model, this integrative tree-based approach usually performs better in robustness and stability for deployment. In this case study, whether to use credit card is our tree node and we adpoted different model stretegy for these two subgroups. This tree-based model had much better performance than the traditional group model using all train dataset.
In addition, this tree-base framework can be easily explained to audience without much data science knowledge.It can be applied for risk stratification and continous prediction in fields that need strong domain knowledge like healthcare, finance, oil&energy and so on.
The general strategy is to build a shallow decision tree before modeling. For my experience, one to three layers would be appropriate based on sample size. Then under each node, we can build seperate models or criteria. The last step is to combind all the results from different subgroups splitted by tree nodes.
Question 5(Option A):
We already create speed and week of month variable in Question 4. Here we firstly do an ANOVA analysis to test null hypothesis: Average trip speeds are materially the same in all weeks of September.ANOVA analysis was conducted between average speed and the week of September (i.e.1~5).
##Anova test###
out<-lm(Speed~week,backupdata)
anova(out)
##Boxplot for week
p <- ggplot(backupdata, aes(x=week, y=Speed)) + theme_bw()+geom_boxplot()
p
#Tukey test
tukey.test <- TukeyHSD(aov(out))
tukey.test
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = out)
##
## $week
## diff lwr upr p adj
## 2-1 -0.6931 -0.737 -0.649 0.00
## 3-1 -0.6837 -0.728 -0.640 0.00
## 4-1 -0.2052 -0.250 -0.161 0.00
## 5-1 -0.8704 -0.929 -0.812 0.00
## 3-2 0.0094 -0.032 0.051 0.97
## 4-2 0.4879 0.445 0.530 0.00
## 5-2 -0.1773 -0.234 -0.120 0.00
## 4-3 0.4785 0.436 0.521 0.00
## 5-3 -0.1867 -0.244 -0.130 0.00
## 5-4 -0.6652 -0.723 -0.608 0.00
#Visualize the 95% CI
plot(tukey.test )
The ANOVA test indicates a large f-value and a small p-value, therefore we reject the null hypothesis and conclude that the week of the month does seem to be related to the average speed. Then we further do a Tukey’s test for multiple group comparisons.Tukey’s test compares the means of all groups to the mean of every other treatment. From Tukey test results, we observed that average speeds have no significant difference in Week 2 and Week 3 with adjusted p value=0.973.The rest of the weeks have very small adjusted p-values, so we reject the null hypothesis and they are significantly different.
Similarly, we firstly check whether speed and time of day has significant association using ANOVA test. The result below shows they are significant associated.
##Anova test###
out<-lm(Speed~pickup_hour,backupdata)
anova(out)
#Histogram#
ggplotly(backupdata %>% group_by(pickup_hour) %>% summarise(ave_speed=mean(Speed)) %>%
ggplot(aes(pickup_hour, ave_speed, fill = ave_speed)) + geom_col() +
geom_label(aes(label=round(ave_speed,2)), size=3.5, alpha=.7) +
# coord_flip() +
#scale_x_continuous(breaks=seq(1,24,1)) +
theme_bw() +
theme(legend.position = 'none') +
labs(title='Average Speed During time of day',y="Average Speed", x="Time of Day (Pickup)"))
See the figure above, there is a very clear increase in average speed until approximately 5:00 AM to 6:00 AM. The minimum average speed appeared round evening that is the busiest work-to-home peak time. The average speed varies significantly over different hour of the day.The average speed is dependent on the hour of day. During the morning before work time, the traffic is not busy and the speed is higher while the speed will decrease significantly during the busy day time. And the speed would increase again at night.